Sharing memory and i/o services between nodes

ABSTRACT

A shared memory controller is to service load and store operations received, over data links, from a plurality of independent nodes to provide access to a shared memory resource. Each of the plurality of independent nodes is to be permitted to access a respective portion of the shared memory resource. Interconnect protocol data and memory access protocol data are sent on the data links and transitions between the interconnect protocol data and memory access protocol data can be defined and identified.

CROSS-REFERENCE TO RELATED APPLICATIONS

This Application is a continuation of U.S. application Ser. No.17/170,619, filed Feb. 8, 2021, which is a continuation of U.S.application Ser. No. 15/0039,468, filed May 26, 2016, which is anational stage application under 35 U.S.C. § 371 of PCT ApplicationPCT/US2013/077785, filed on Dec. 26, 2013 and entitled “Sharing Memoryand I/O Services Between Nodes”, which is incorporated by reference inits entirety. The disclosures of the prior applications are consideredpart of and are hereby incorporated by reference in their entirety inthe disclosure of this application.

FIELD

This disclosure pertains to computing system, and in particular (but notexclusively) to memory access between components in a computing system.

BACKGROUND

Advances in semi-conductor processing and logic design have permitted anincrease in the amount of logic that may be present on integratedcircuit devices. As a corollary, computer system configurations haveevolved from a single or multiple integrated circuits in a system tomultiple cores, multiple hardware threads, and multiple logicalprocessors present on individual integrated circuits, as well as otherinterfaces integrated within such processors. A processor or integratedcircuit typically comprises a single physical processor die, where theprocessor die may include any number of cores, hardware threads, logicalprocessors, interfaces, memory, controller hubs, etc.

As a result of the greater ability to fit more processing power insmaller packages, smaller computing devices have increased inpopularity. Smartphones, tablets, ultrathin notebooks, and other userequipment have grown exponentially. However, these smaller devices arereliant on servers both for data storage and complex processing thatexceeds the form factor. Consequently, the demand in thehigh-performance computing market (i.e. server space) has alsoincreased. For instance, in modern servers, there is typically not onlya single processor with multiple cores, but also multiple physicalprocessors (also referred to as multiple sockets) to increase thecomputing power. But as the processing power grows along with the numberof devices in a computing system, the communication between sockets andother devices becomes more critical.

In fact, interconnects have grown from more traditional multi-drop busesthat primarily handled electrical communications to full blowninterconnect architectures that facilitate fast communication.Unfortunately, as the demand for future processors to consume at evenhigher-rates corresponding demand is placed on the capabilities ofexisting interconnect architectures.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an embodiment of a computing system including aninterconnect architecture.

FIG. 2 illustrates an embodiment of a interconnect architectureincluding a layered stack.

FIG. 3 illustrates an embodiment of a request or packet to be generatedor received within an interconnect architecture.

FIG. 4 illustrates an embodiment of a transmitter and receiver pair foran interconnect architecture.

FIG. 5A illustrates a simplified block diagram of an embodiment of anexample node.

FIG. 5B illustrates a simplified block diagram of an embodiment of anexample system including a plurality of nodes.

FIG. 6 is a representation of data transmitted according to an exampleshared memory link.

FIG. 7A is a representation of data transmitted according to anotherexample of a shared memory link.

FIG. 7B is a representation of an example start of data framing token.

FIG. 8 is a representation of data transmitted according to anotherexample of a shared memory link.

FIGS. 9A-9D are flowcharts illustrating example techniques for memoryaccess messaging.

FIG. 10 illustrates an embodiment of a block diagram for a computingsystem including a multicore processor.

FIG. 11 illustrates another embodiment of a block diagram for acomputing system including a multicore processor.

FIG. 12 illustrates an embodiment of a block diagram for a processor.

FIG. 13 illustrates another embodiment of a block diagram for acomputing system including a processor.

FIG. 14 illustrates an embodiment of a block for a computing systemincluding multiple processors.

FIG. 15 illustrates an example system implemented as system on chip(SoC).

Like reference numbers and designations in the various drawings indicatelike elements.

DETAILED DESCRIPTION

In the following description, numerous specific details are set forth,such as examples of specific types of processors and systemconfigurations, specific hardware structures, specific architectural andmicro architectural details, specific register configurations, specificinstruction types, specific system components, specificmeasurements/heights, specific processor pipeline stages and operationetc. in order to provide a thorough understanding of the presentinvention. It will be apparent, however, to one skilled in the art thatthese specific details need not be employed to practice the presentinvention. In other instances, well known components or methods, such asspecific and alternative processor architectures, specific logiccircuits/code for described algorithms, specific firmware code, specificinterconnect operation, specific logic configurations, specificmanufacturing techniques and materials, specific compilerimplementations, specific expression of algorithms in code, specificpower down and gating techniques/logic and other specific operationaldetails of computer system haven't been described in detail in order toavoid unnecessarily obscuring the present invention.

Although the following embodiments may be described with reference toenergy conservation and energy efficiency in specific integratedcircuits, such as in computing platforms or microprocessors, otherembodiments are applicable to other types of integrated circuits andlogic devices. Similar techniques and teachings of embodiments describedherein may be applied to other types of circuits or semiconductordevices that may also benefit from better energy efficiency and energyconservation. For example, the disclosed embodiments are not limited todesktop computer systems or Ultrabooks™. And may be also used in otherdevices, such as handheld devices, tablets, other thin notebooks,systems on a chip (SOC) devices, and embedded applications. Someexamples of handheld devices include cellular phones, Internet protocoldevices, digital cameras, personal digital assistants (PDAs), andhandheld PCs. Embedded applications typically include a microcontroller,a digital signal processor (DSP), a system on a chip, network computers(NetPC), set-top boxes, network hubs, wide area network (WAN) switches,or any other system that can perform the functions and operations taughtbelow. Moreover, the apparatus', methods, and systems described hereinare not limited to physical computing devices, but may also relate tosoftware optimizations for energy conservation and efficiency. As willbecome readily apparent in the description below, the embodiments ofmethods, apparatus', and systems described herein (whether in referenceto hardware, firmware, software, or a combination thereof) are vital toa ‘green technology’ future balanced with performance considerations.

As computing systems are advancing, the components therein are becomingmore complex. As a result, the interconnect architecture to couple andcommunicate between the components is also increasing in complexity toensure bandwidth requirements are met for optimal component operation.Furthermore, different market segments demand different aspects ofinterconnect architectures to suit the market's needs. For example,servers require higher performance, while the mobile ecosystem issometimes able to sacrifice overall performance for power savings. Yet,it's a singular purpose of most fabrics to provide highest possibleperformance with maximum power saving. Below, a number of interconnectsare discussed, which would potentially benefit from aspects of theinvention described herein.

One interconnect fabric architecture includes the Peripheral ComponentInterconnect (PCI) Express (PCIe) architecture. A primary goal of PCIeis to enable components and devices from different vendors tointer-operate in an open architecture, spanning multiple marketsegments; Clients (Desktops and Mobile), Servers (Standard andEnterprise), and Embedded and Communication devices. PCI Express is ahigh performance, general purpose I/O interconnect defined for a widevariety of future computing and communication platforms. Some PCIattributes, such as its usage model, load-store architecture, andsoftware interfaces, have been maintained through its revisions, whereasprevious parallel bus implementations have been replaced by a highlyscalable, fully serial interface. The more recent versions of PCIExpress take advantage of advances in point-to-point interconnects,Switch-based technology, and packetized protocol to deliver new levelsof performance and features. Power Management, Quality Of Service (QoS),Hot-Plug/Hot- Swap support, Data Integrity, and Error Handling are amongsome of the advanced features supported by PCI Express.

Referring to FIG. 1, an embodiment of a fabric composed ofpoint-to-point Links that interconnect a set of components isillustrated. System 100 includes processor 105 and system memory 110coupled to controller hub 115. Processor 105 includes any processingelement, such as a microprocessor, a host processor, an embeddedprocessor, a co-processor, or other processor. Processor 105 is coupledto controller hub 115 through front-side bus (FSB) 106. In oneembodiment, FSB 106 is a serial point-to-point interconnect as describedbelow. In another embodiment, link 106 includes a serial, differentialinterconnect architecture that is compliant with different interconnectstandard.

System memory 110 includes any memory device, such as random accessmemory (RAM), non-volatile (NV) memory, or other memory accessible bydevices in system 100. System memory 110 is coupled to controller hub115 through memory interface 116. Examples of a memory interface includea double-data rate (DDR) memory interface, a dual-channel DDR memoryinterface, and a dynamic RAM (DRAM) memory interface.

In one embodiment, controller hub 115 is a root hub, root complex, orroot controller in a Peripheral Component Interconnect Express (PCIe orPCIE) interconnection hierarchy. Examples of controller hub 115 includea chipset, a memory controller hub (MCH), a northbridge, an interconnectcontroller hub (ICH) a southbridge, and a root controller/hub. Often theterm chipset refers to two physically separate controller hubs, i.e. amemory controller hub (MCH) coupled to an interconnect controller hub(ICH). Note that current systems often include the MCH integrated withprocessor 105, while controller 115 is to communicate with I/O devices,in a similar manner as described below. In some embodiments,peer-to-peer routing is optionally supported through root complex 115.

Here, controller hub 115 is coupled to switch/bridge 120 through seriallink 119. Input/output modules 117 and 121, which may also be referredto as interfaces/ports 117 and 121, include/implement a layered protocolstack to provide communication between controller hub 115 and switch120. In one embodiment, multiple devices are capable of being coupled toswitch 120.

Switch/bridge 120 routes packets/messages from device 125 upstream, i.e.up a hierarchy towards a root complex, to controller hub 115 anddownstream, i.e. down a hierarchy away from a root controller, fromprocessor 105 or system memory 110 to device 125. Switch 120, in oneembodiment, is referred to as a logical assembly of multiple virtualPCI-to-PCI bridge devices. Device 125 includes any internal or externaldevice or component to be coupled to an electronic system, such as anI/O device, a Network Interface Controller (NIC), an add-in card, anaudio processor, a network processor, a hard-drive, a storage device, aCD/DVD ROM, a monitor, a printer, a mouse, a keyboard, a router, aportable storage device, a Firewire device, a Universal Serial Bus (USB)device, a scanner, and other input/output devices. Often in the PCIevernacular, such as device, is referred to as an endpoint. Although notspecifically shown, device 125 may include a PCIe to PCI/PCI-X bridge tosupport legacy or other version PCI devices. Endpoint devices in PCIeare often classified as legacy, PCIe, or root complex integratedendpoints.

Graphics accelerator 130 is also coupled to controller hub 115 throughserial link 132. In one embodiment, graphics accelerator 130 is coupledto an MCH, which is coupled to an ICH. Switch 120, and accordingly I/Odevice 125, is then coupled to the ICH. I/O modules 131 and 118 are alsoto implement a layered protocol stack to communicate between graphicsaccelerator 130 and controller hub 115. Similar to the MCH discussionabove, a graphics controller or the graphics accelerator 130 itself maybe integrated in processor 105.

Turning to FIG. 2 an embodiment of a layered protocol stack isillustrated. Layered protocol stack 200 includes any form of a layeredcommunication stack, such as a Quick Path Interconnect (QPI) stack, aPCie stack, a next generation high performance computing interconnectstack, or other layered stack. Although the discussion immediately belowin reference to FIGS. 1-4 are in relation to a PCIe stack, the sameconcepts may be applied to other interconnect stacks. In one embodiment,protocol stack 200 is a PCIe protocol stack including transaction layer205, link layer 210, and physical layer 220. An interface, such asinterfaces 117, 118, 121, 122, 126, and 131 in FIG. 1, may berepresented as communication protocol stack 200. Representation as acommunication protocol stack may also be referred to as a module orinterface implementing/including a protocol stack.

PCI Express uses packets to communicate information between components.Packets are formed in the Transaction Layer 205 and Data Link Layer 210to carry the information from the transmitting component to thereceiving component. As the transmitted packets flow through the otherlayers, they are extended with additional information necessary tohandle packets at those layers. At the receiving side the reverseprocess occurs and packets get transformed from their Physical Layer 220representation to the Data Link Layer 210 representation and finally(for Transaction Layer Packets) to the form that can be processed by theTransaction Layer 205 of the receiving device.

Transaction Layer

In one embodiment, transaction layer 205 is to provide an interfacebetween a device's processing core and the interconnect architecture,such as data link layer 210 and physical layer 220. In this regard, aprimary responsibility of the transaction layer 205 is the assembly anddisassembly of packets (i.e., transaction layer packets, or TLPs). Thetranslation layer 205 typically manages credit-base flow control forTLPs. PCIe implements split transactions, i.e. transactions with requestand response separated by time, allowing a link to carry other trafficwhile the target device gathers data for the response.

In addition PCIe utilizes credit-based flow control. In this scheme, adevice advertises an initial amount of credit for each of the receivebuffers in Transaction Layer 205. An external device at the opposite endof the link, such as controller hub 115 in FIG. 1, counts the number ofcredits consumed by each TLP. A transaction may be transmitted if thetransaction does not exceed a credit limit. Upon receiving a response anamount of credit is restored. An advantage of a credit scheme is thatthe latency of credit return does not affect performance, provided thatthe credit limit is not encountered.

In one embodiment, four transaction address spaces include aconfiguration address space, a memory address space, an input/outputaddress space, and a message address space. Memory space transactionsinclude one or more of read requests and write requests to transfer datato/from a memory-mapped location. In one embodiment, memory spacetransactions are capable of using two different address formats, e.g., ashort address format, such as a 32-bit address, or a long addressformat, such as 64-bit address. Configuration space transactions areused to access configuration space of the PCIe devices. Transactions tothe configuration space include read requests and write requests.Message space transactions (or, simply messages) are defined to supportin-band communication between PCIe agents.

Therefore, in one embodiment, transaction layer 205 assembles packetheader/payload 206. Format for current packet headers/payloads may befound in the PCIe specification at the PCIe specification website.

Quickly referring to FIG. 3, an embodiment of a PCIe transactiondescriptor is illustrated. In one embodiment, transaction descriptor 300is a mechanism for carrying transaction information. In this regard,transaction descriptor 300 supports identification of transactions in asystem. Other potential uses include tracking modifications of defaulttransaction ordering and association of transaction with channels.

Transaction descriptor 300 includes global identifier field 302,attributes field 304 and channel identifier field 306. In theillustrated example, global identifier field 302 is depicted comprisinglocal transaction identifier field 308 and source identifier field 310.In one embodiment, global transaction identifier 302 is unique for alloutstanding requests.

According to one implementation, local transaction identifier field 308is a field generated by a requesting agent, and it is unique for alloutstanding requests that require a completion for that requestingagent. Furthermore, in this example, source identifier 310 uniquelyidentifies the requestor agent within a PCIe hierarchy. Accordingly,together with source ID 310, local transaction identifier 308 fieldprovides global identification of a transaction within a hierarchydomain.

Attributes field 304 specifies characteristics and relationships of thetransaction. In this regard, attributes field 304 is potentially used toprovide additional information that allows modification of the defaulthandling of transactions. In one embodiment, attributes field 304includes priority field 312, reserved field 314, ordering field 316, andno-snoop field 318. Here, priority sub-field 312 may be modified by aninitiator to assign a priority to the transaction. Reserved attributefield 314 is left reserved for future, or vendor-defined usage. Possibleusage models using priority or security attributes may be implementedusing the reserved attribute field.

In this example, ordering attribute field 316 is used to supply optionalinformation conveying the type of ordering that may modify defaultordering rules. According to one example implementation, an orderingattribute of “0” denotes default ordering rules are to apply, wherein anordering attribute of “1” denotes relaxed ordering, wherein writes canpass writes in the same direction, and read completions can pass writesin the same direction. Snoop attribute field 318 is utilized todetermine if transactions are snooped. As shown, channel ID Field 306identifies a channel that a transaction is associated with.

Link Layer

Link layer 210, also referred to as data link layer 210, acts as anintermediate stage between transaction layer 205 and the physical layer220. In one embodiment, a responsibility of the data link layer 210 isproviding a reliable mechanism for exchanging Transaction Layer Packets(TLPs) between two components a link. One side of the Data Link Layer210 accepts TLPs assembled by the Transaction Layer 205, applies packetsequence identifier 211, i.e. an identification number or packet number,calculates and applies an error detection code, i.e. CRC 212, andsubmits the modified TLPs to the Physical Layer 220 for transmissionacross a physical to an external device.

Physical Layer

In one embodiment, physical layer 220 includes logical sub block 221 andelectrical sub-block 222 to physically transmit a packet to an externaldevice. Here, logical sub-block 221 is responsible for the “digital”functions of Physical Layer 221. In this regard, the logical sub-blockincludes a transmit section to prepare outgoing information fortransmission by physical sub-block 222, and a receiver section toidentify and prepare received information before passing it to the LinkLayer 210.

Physical block 222 includes a transmitter and a receiver. Thetransmitter is supplied by logical sub-block 221 with symbols, which thetransmitter serializes and transmits onto to an external device. Thereceiver is supplied with serialized symbols from an external device andtransforms the received signals into a bit-stream. The bit-stream isde-serialized and supplied to logical sub-block 221. In one embodiment,an 8 b/10 b transmission code is employed, where ten-bit symbols aretransmitted/received. Here, special symbols are used to frame a packetwith frames 223. In addition, in one example, the receiver also providesa symbol clock recovered from the incoming serial stream.

As stated above, although transaction layer 205, link layer 210, andphysical layer 220 are discussed in reference to a specific embodimentof a PCIe protocol stack, a layered protocol stack is not so limited. Infact, any layered protocol may be included/implemented. As an example,an port/interface that is represented as a layered protocol includes:(1) a first layer to assemble packets, i.e. a transaction layer; asecond layer to sequence packets, i.e. a link layer; and a third layerto transmit the packets, i.e. a physical layer. As a specific example, acommon standard interface (CSI) layered protocol is utilized.

Referring next to FIG. 4, an embodiment of a PCIe serial point to pointfabric is illustrated. Although an embodiment of a PCIe serialpoint-to-point link is illustrated, a serial point-to-point link is notso limited, as it includes any transmission path for transmitting serialdata. In the embodiment shown, a basic PCIe link includes two,low-voltage, differentially driven signal pairs: a transmit pair 406/411and a receive pair 412/407. Accordingly, device 405 includestransmission logic 406 to transmit data to device 410 and receivinglogic 407 to receive data from device 410. In other words, twotransmitting paths, i.e. paths 416 and 417, and two receiving paths,i.e. paths 418 and 419, are included in a PCIe link.

A transmission path refers to any path for transmitting data, such as atransmission line, a copper line, an optical line, a wirelesscommunication channel, an infrared communication link, or othercommunication path. A connection between two devices, such as device 405and device 410, is referred to as a link, such as link 415. A link maysupport one lane—each lane representing a set of differential signalpairs (one pair for transmission, one pair for reception). To scalebandwidth, a link may aggregate multiple lanes denoted by xN, where N isany supported Link width, such as 1, 2, 4, 8, 12, 16, 32, 64, or wider.

A differential pair refers to two transmission paths, such as lines 416and 417, to transmit differential signals. As an example, when line 416toggles from a low voltage level to a high voltage level, i.e. a risingedge, line 417 drives from a high logic level to a low logic level, i.e.a falling edge. Differential signals potentially demonstrate betterelectrical characteristics, such as better signal integrity, i.e.cross-coupling, voltage overshoot/undershoot, ringing, etc. This allowsfor better timing window, which enables faster transmission frequencies.

Physical layers of existing interconnect and communicationarchitectures, including PCIe, can be leveraged to provide shared memoryand I/O services within a system. Traditionally, cacheable memory cannotbe shared between independent systems using traditional load/store(LD/ST) memory semantics. An independent system, or “node”, can beindependent in the sense that it functions as a single logical entity,is controlled by a single operating system (and/or single BIOS orVirtual Machine Monitor (VMM)), and/or has an independent fault domain.A single node can include one or multiple processor devices, beimplemented on a single board or multiple boards, and include localmemory, including cacheable memory that can be accessed using LD/STsemantics by the devices on the same node. Within a node, shared memorycan include one or more blocks of memory, such as a random access memory(RAM), that can be accessed by several different processors (e.g.,central processing units (CPUs)) within a node. Shared memory can alsoinclude the local memory of the processors or other devices in the node.The multiple devices within a node having shared memory can share asingle view of data within the shared memory. I/O communicationinvolving shared memory can be very low latency and allow quick accessto the memory by the multiple processors.

Traditionally, memory sharing between different nodes has not allowedmemory sharing according to a load/store paradigm. For instance, in somesystems, memory sharing between different nodes has been facilitatedthrough distributed memory architectures. In traditional solutions,computational tasks operate on local data, and if data of another nodeis desired, the computational task (e.g., executed by another CPU node)communicates with the other node, for instance, over a communicationchannel utilizing a communication protocol stack, such as Ethernet,InfiniBand, or another layered protocol. In traditional multi-nodesystems, the processors of different nodes do not have to be aware wheredata resides. Sharing data using traditional approaches, such as over aprotocol stack, can have a significantly higher latency than memorysharing within a node using a load/store paradigm. Rather than directlyaddressing and operating on data in shared memory, one node can requestdata from another using an existing protocol handshake such as Ethernet(or Infiniband), and the source node can provide the data, such that thedata can be stored and operated on by the requesting node, among otherexamples.

In some implementations, a shared memory architecture can be providedthat allows memory to be shared between independent nodes for exclusiveor shared access using load/store (LD/ST) memory semantics. In oneexample, memory semantics (and directory information, if applicable)along with I/O semantics (for protocols such as PCIe) can be exported oneither a common set of pins or a separate set of pins. In such a system,the improved shared memory architecture can each of a plurality of nodesin a system to maintain its own independent fault domain (and localmemory), while enabling a shared memory pool for access by the nodes andlow-latency message passing between nodes using memory according toLD/ST semantics. In some implementations, such a shared memory pool canbe dynamically (or statically) allocated between different nodes.Accordingly, one can also configure the various nodes of a system intodynamically changing groups of nodes to work cooperatively and flexiblyon various tasks making use of the shared memory infrastructure, forinstance, as demand arises.

Turning to FIG. 5A, a simplified block diagram 500 a is shownillustrating an example system including shared memory 505 capable ofbeing accessed using load/store techniques by each of a plurality ofindependent nodes 510 a-510 n. For instance, a shared memory controller515 can be provided that can accept load/store access requests of thevarious nodes 510 a-510 n on the system. Shared memory 505 can beimplemented utilizing synchronous dynamic random access memory (SDRAM),dual in-line memory modules (DIMM), and other non-volatile memory (orvolatile memory).

Each node may itself have one or multiple CPU sockets and may alsoinclude local memory that remains insulated from LD/ST access by othernodes in the system. The node can communicate with other devices on thesystem (e.g., shared memory controller 515, networking controller 520,other nodes, etc.) using one or more protocols, including PCIe, QPI,Ethernet, among other examples. In some implementations, a shared memorylink (SML) protocol can be provided through which low latency LD/STmemory semantics can be supported. SML can be used, for instance, incommunicating reads and writes of shared memory 505 (through sharedmemory controller 515) by the various nodes 510 a-510 n of a system.

In one example, SML can be based on a memory access protocol, such asScalable Memory Interconnect (SMI) 3rd generation (SMI3). Other memoryaccess protocols can be alternatively used, such as transactional memoryaccess protocols such as fully buffered DIMM (FB-DIMM), DDRTransactional (DDR-T), among other examples. In other instances, SML canbe based on native PCIe memory read/write semantics with additionaldirectory extensions. A memory-protocol-based implementation of SML canoffer bandwidth efficiency advantages due to being tailored to cacheline memory accesses. While high performance inter-device communicationprotocols exist, such as PCIe, upper layers (e.g., transaction and linklayers) of such protocols can introduce latency that degradesapplication of the full protocol for use in LD/ST memory transactions,including transactions involving a shared memory 505. A memory protocol,such as SMI3, can allow a potential additional advantage of offeringlower latency accesses since it can bypass most of another protocolstack, such as PCIe. Accordingly, implementations of SML can utilizeSMI3 or another memory protocol running on a logical and physical PHY ofanother protocol, such as SMI3 on PCIe.

As noted, in some implementation, a shared memory controller (SMC) 515can be provided that includes logic for handling load/store requests ofnodes 510 a-510 n in the system. Load/store requests can be received bythe SMC 515 over links utilizing SML and connecting the nodes 510 a-510n to the SMC 515. In some implementations the SMC 515 can be implementedas a device, such as an application-specific integrated circuit (ASIC),including logic for servicing the access requests of the nodes 510 a-510n for shared memory resources. In other instances, the SMC 515 (as wellas shared memory 505) can reside on a device, chip, or board separatefrom one or more (or even all) of the nodes 510 a-510 n. The SMC 515 canfurther include logic to coordinate various nodes' transactions thatinvolve shared memory 505. Additionally, the SMC can maintain adirectory tracking access to various data resources, such as each cacheline, included in shared memory 505. For instance, a data resource canbe in a shared access state (e.g., capable of being accessed (e.g.,loaded or read) by multiple processing and/or I/O devices within a node,simultaneously), an exclusive access state (e.g., reserved exclusively,if not temporarily, by a single processing and/or I/O device within anode (e.g., for a store or write operation), an uncached state, amongother potential examples. Further, while each node may have directaccess to one or more portions of shared memory 505, differentaddressing schemes and values may be employed by the various nodes(e.g., 510 a-510 n) resulting in the same shared memory data beingreferred to (e.g., in an instruction) by a first node according to afirst address value and a second node being referring to the same databy a second address value. The SMC 515 can include logic, including datastructures mapping nodes' addresses to shared memory resources, to allowthe SMC 515 to interpret the various access requests of the variousnodes.

Additionally, in some cases, some portion of shared memory (e.g.,certain partitions, memory blocks, records, files, etc.) may be subjectto certain permissions, rules, and assignments such that only a portionof the nodes 510 a-510 n are allowed (e.g., by the SMC 515) to accessthe corresponding data. Indeed, each shared memory resource may beassigned to a respective (and in some cases different) subset of thenodes 510 a-510 n of the system. These assignments can be dynamic andSMC 515 can modify such rules and permissions (e.g., on-demand,dynamically, etc.) to accommodate new or changed rules, permissions,node assignments and ownership applicable to a given portion of theshared memory 505.

An example SMC 515 can further track various transactions involvingnodes (e.g., 510 a-510 n) in the system accessing one or more sharedmemory resources. For instance, SMC 515 can track information for eachshared memory 505 transaction, including identification of the node(s)involved in the transaction, progress of the transaction (e.g., whetherit has been completed), among other transaction information. This canpermit some of the transaction-oriented aspects of traditionaldistributed memory architectures to be applied to the improvedmulti-node shared memory architecture described herein. Additionally,transaction tracking (e.g., by the SMC) can be used to assist inmaintaining or enforcing the distinct and independent fault domains ofeach respective node. For instance, the SMC can maintain thecorresponding Node ID for each transaction-in-progress in its internaldata structures, including in memory, and use that information toenforce access rights and maintain individual fault-domains for eachnode. Accordingly, when one of the nodes goes down (e.g., due to acritical error, triggered recovery sequence, or other fault or event),only that node and its transactions involving the shared memory 505 areinterrupted (e.g., dumped by the SMC)—transactions of the remainingnodes that involve the shared memory 505 continue on independent of thefault in the other node.

A system can include multiple nodes. Additionally, some example systemscan include multiple SMCs. In some cases, a node may be able to accessshared memory off a remote SMC to which it is not directly attached to(i.e., the node's local SMC connects to the remote SMC through one ormultiple SML Link hops). The remote SMC may be in the same board orcould be in a different board. In some cases, some of the nodes may beoff-system (e.g., off board or off chip) but nonetheless access sharedmemory 505. For instance, one or more off-system nodes can connectdirectly to the SMC using an SML-compliant link, among other examples.Additionally, other systems that include their own SMC and shared memorycan also connect with the SMC 510 to extend sharing of memory 505 tonodes included, for instance, on another board that interface with theother SMC connected to the SMC over an SML link. Still further, networkconnections can be tunneled through to further extend access to otheroff-board or off-chip nodes. For instance, SML can tunnel over anEthernet connection (e.g., provided through network controller 520)communicatively coupling the example system of FIG. 5A with anothersystem that can also include one or more other nodes and allow thesenodes to also gain access to SMC 515 and thereby shared memory 505,among other examples.

As another example, as shown in the simplified block diagram 500 b ofFIG. 5B, an improved shared memory architecture permitting shared accessby multiple independent nodes according to a LD/ST memory semantic canflexibly allow for the provision of a variety of different multi-nodesystem designs. Various combinations of the multiple nodes can beassigned to share portions of one or more shared memory blocks providedin an example system. For instance, another example system shown in theexample of FIG. 5B, can include multiple devices 550 a-550 dimplemented, for instance, as separate dies, boards, chips, etc., eachdevice including one or more independent CPU nodes (e.g., 510 a-510 h).Each node can include its own local memory. One or more of the multipledevices 550 a-550 d can further include shared memory that can beaccessed by two or more of the nodes 510 a-510 h of the system.

The system illustrated in FIG. 5B is an example provided to illustratesome of the variability that can be realized through an improved sharedmemory architecture, such as shown and described herein. For instance,each of a Device A 550 a and Device C 550 c can include a respectiveshared memory element (e.g., 505 a, 505 b). Accordingly, in someimplementations, each shared memory element on a distinct device mayfurther include a respective shared memory controller (SMC) 515 a, 515b. Various combinations of nodes 510 a-510 h can be communicativelycoupled to each SMC (e.g., 515 a, 515 b) allowing the nodes to accessthe corresponding shared memory (e.g., 505 a, 505 b). As an example, SMC515 a of Device A 550 a can connect to nodes 510 a, 510 b on Device Ausing a direct data link supporting SML. Additionally, another node 510c on another device (e.g., Device C 550 c) can also have access to theshared memory 505 a by virtue of a direct, hardwired connection(supporting SML) from the node 510 c (and/or its device 550 c) to SMC515 a. Indirect, network-based, or other such connections can also beused to allow nodes (e.g., 510 f-510 h) of a remote or off-board device(e.g., Device D 550 d) to utilize a conventional protocol stack tointerface with SMC 515 a to also have access to shared memory 505 a. Forinstance, an SML tunnel 555 can be established over an Ethernet,InfiniBand, or other connection coupling Device A and Device D. Whileestablishing and maintaining the tunnel can introduce some additionaloverhead and latency, compared to SML running on otherless-software-managed physical connections, the SML tunnel 555 whenestablished can operate as other SML channels and allow the nodes 510f-510 h to interface with SMC 515 a over SML and access shared memory505 a as any other node communicating with SMC over an SML link can. Forinstance, reliability and ordering of the packets in the SML channelscan be enforced either by the networking components in the system or itcan be enforced end-to-end between the SMCs.

In still other examples, nodes (e.g., 515 d, 515 e) on a devicedifferent from that hosting a particular portion of shared memory (e.g.,505 a) can connect indirectly to the corresponding SMC (e.g., SMC 515 a)by connecting directly to another SMC (e.g., 515 b) that is itselfcoupled (e.g., using an SML link) to the corresponding SMC (e.g., 515a). Linking two or more SMCs (e.g., 515 a, 515 b) can effectively expandthe amount of shared memory available to the nodes 510 a-510 h on thesystem. For instance, by virtue of a link between SMCs 515 a, 515 b inthe example of FIG. 5B, in some implementations, any of the nodes (e.g.,510 a-510 c, 510 f-510 h) capable of accessing shared memory 505 athrough SMC 515 a may also potentially access sharable memory 505 b byvirtue of the connection between SMC 515 a and SMC 515 b. Likewise, insome implementations, each of the nodes directly accessing SMC 515 b canalso access sharable memory 505 a by virtue of the connection betweenthe SMCs 515 a, 515 b, among other potential examples.

As noted above, an improved shared memory architecture can include alow-latency link protocol (i.e., SML) based on a memory access protocol,such as SMI3, and provided to facilitate load/store requests involvingthe shared memory. Whereas traditional SMI3 and other memory accessprotocols may be configured for use in memory sharing within a singlenode, SML can extend memory access semantics to multiple nodes to allowmemory sharing between the multiple nodes. Further, SML can potentiallybe utilized on any physical communication link. SML can utilize a memoryaccess protocol supporting LD/ST memory semantics that is overlaid on aphysical layer (and corresponding physical layer logic) adapted tointerconnect distinct devices (and nodes). Additionally, physical layerlogic of SML can provide for no packet dropping and error retryfunctionality, among other features.

In some implementations, SML can be can be implemented by overlayingSMI3 on a PCIe PHY. An SML link layer can be provided (e.g., in lieu ofa traditional PCIe link layer) to forego flow control and other featuresand facilitate lower latency memory access such as would becharacteristic in traditional CPU memory access architectures. In oneexample, SML link layer logic can multiplex between shared memorytransactions and other transactions. For instance, SML link layer logiccan multiplex between SMI3 and PCIe transactions. For instance, SMI3 (oranother memory protocol) can overlay on top of PCIe (or anotherinterconnect protocol) so that the link can dynamically switch betweenSMI3 and PCIe transactions. This can allow traditional PCIe traffic toeffectively coexist on the same link as SML traffic in some instances.

Turning to FIG. 6, a representation 600 is shown illustrating a firstimplementation of SML. For instance, SML can be implemented byoverlaying SMI3 on a PCIe PHY. The physical layer can use standard PCIe128 b/130 b encoding for all physical layer activities including linktraining as well as PCIe data blocks. SML can provide for traffic on thelanes (e.g., Lane 0-Lane 7) of the link to be multiplexed between PCIepackets and SMI3 flits. For example, in the implementation illustratedin FIG. 6, the sync header of the PCIe 128 b/130 b encoding can bemodified and used to indicate that SMI3 flits are to be sent on thelanes of the link rather than PCIe packets. In traditional PCIe 128b/130 b encoding, valid sync headers (e.g., 610) can include the sendingof either a 10 b pattern on all lanes of the link (to indicate that thetype of payload of the block is to be PCIe Data Block) or a 01b patternon all lanes of the link (to indicate that the type of payload of theblock is to be PCIe Ordered Set Block). In an example of SML, analternate sync header can be defined to differentiate SMI3 flit trafficfrom PCIe data blocks and ordered sets. In one example, illustrated inFIG. 6, the PCIe 128b/130b sync header (e.g., 605 a, 605 b) can beencoded with alternating 01b, 10b patterns on odd/even lanes to identifythat SMI3 flits are to be sent. In another alternative implementation,the 128 b/130 b sync header encoding for SMI3 traffic can be defined byalternating 10 b, 01 b patterns on odd/even lanes, among other exampleencodings. In some cases, SMI3 flits can be transmitted immediatelyfollowing the SMI3 sync header on a per-byte basis, with the transitionbetween PCIe and SMI3 protocols taking place at the block boundary.

In some implementations, such as that illustrated in the example of FIG.6, the transition between the protocols can be defined to take place atthe block boundary irrespective of whether it corresponds to an SMI3flit or PCIe packet boundary. For instance, a block can be defined toinclude a predefined amount of data (e.g., 16 symbols, 128 bytes, etc.).In such implementations, when the block boundary does not correspond toan SMI3 flit or PCIe packet boundary, the transmission of an entire SMI3flit may be interrupted. An interrupted SMI3 flit can be resumed in thenext SMI3 block indicated by the sending of another sync header encodedfor SMI3.

Turning to FIG. 7A, a representation 700 is shown illustrating anotherexample implementation of SML. In the example of FIG. 7A, rather thanusing a specialized sync header encoding to signal transitions betweenmemory access and interconnect protocol traffic, physical layer framingtokens can be used. A framing token (or “token”) can be a physical layerdata encapsulation that specifies or implies the number of symbols to beincluded in a stream of data associated with the token. Consequently,the framing token can identify that a stream is beginning as well asimply where it will end and can therefore be used to also identify thelocation of the next framing token. A framing token of a data stream canbe located in the first symbol (Symbol 0) of the first lane (e.g., Lane0) of the first data block of the data stream. In the example of PCIe,five framing tokens can be defined, including start of TLP traffic (STP)token, end of data stream (EDS) token, end bad (EDB) token, start ofDLLP (SDP) token, and logical idle (IDL) token.

In the example of FIG. 7A, SML can be implemented by overlaying SMI3 oranother data access protocol on PCIe and the standard PCIe STP token canbe modified to define a new STP token that identifies that SMI3 (insteadof TLP traffic) is to commence on the lanes of the link. In someexamples, values of reserve bits of the standard PCIe STP token can bemodified to define the SMI3 STP token in SML. Further, as shown in FIG.7B, an STP token 705 can include several fields, including a 710 fieldthat identifies the length of the SMI3 payload (in terms of the numberof flits) that is to follow. In some implementations, one or morestandard payload lengths can be defined for TLP data. SMI3 data can, insome implementations, be defined to include a fixed number of flits, orin other cases, may have variable numbers of flits in which case thelength field for the number of SMI3 flits becomes a field that can bedisregarded. Further, the length field for an SMI3 STP can be defined asa length other than one of the defined TLP payload lengths. Accordingly,an SMI3 STP can be identified based on a non-TLP length value beingpresent in the STP length field, as one example. For example, in oneimplementation, the upper 3-bits of the 11-bit STP length field can beset to 111 b to indicate the SMI3 packet (e.g., based on the assumptionthat no specification-compliant PCIe TLP can be long enough to have alength where the upper 3 bits of the length field would result in 1's).Other implementations can alter or encode other fields of the STP tokento differentiate a PCIe STP token identifying a traditional PCIe TLPdata payload from a SMI3 STP token identifying that SMI3 data isencapsulated in TLP data.

Returning to the example of FIG. 7A, sync header data can follow theencoding specified for traditional PCIe 128b/130b encoding. Forinstance, at 715 a-c, sync headers with value 10 b are receivedindicating that data blocks are forthcoming. When a PCIe STP (e.g., 720)is received, a PCIe TLP payload is expected and the data stream isprocessed accordingly. Consistent with the payload length identified inthe PCIe STP 720, the PCIe TLP payload can utilize the full payloadlength allocated. Another STP token can be received essentially at anytime within a data block following the end of the TLP payload. Forinstance, at 725, an SMI3 STP can be received signaling a transitionfrom PCIe TLP data to SMI3 flit data. The SMI3 STP can be sent, forinstance, as soon as an end of the PCIe packet data is identified.

Continuing with the example of FIG. 7A, as with PCIe TLP data, the SMI3STP 725 can define a length of the SMI3 flit payload that is to follow.For instance, the payload length of the SMI3 data can correspond to thenumber of SMI3 flits in terms of DWs to follow. A window (e.g., endingat Symbol 15 of Lane 3) corresponding to the payload length can therebybe defined on the lanes, in which only SMI3 data is to be sent duringthe window. When the window concludes, other data can be sent, such asanother PCIe STP to recommence sending of TLP data or other data, suchas ordered set data. For instance, as shown in the example of FIG. 7A,an EDS token is sent following the end of the SMI3 data window definedby SMI3 STP token 725. The EDS token can signal the end of the datastream and imply that an ordered set block is to follow, as is the casein the example of FIG. 7A. A sync header 740 is sent that is encoded 01bto indicate that an ordered set block is to be sent. In this case a PCIeSKP ordered set is sent. Such ordered sets can be sent periodically oraccording to set intervals or windows such that various PHY-level tasksand coordination can be performed, including initializing bit alignment,initializing symbol alignment, exchanging PHY parameters, compensatingfor different bit rates for two communicating ports, among otherexamples. In some cases, a mandated ordered set can be sent to interrupta defined window or data block specified for SMI3 flit data by acorresponding SMI3 STP token.

While not shown explicitly in the example of FIG. 7A, an STP token canalso be used to transition from SMI3 flit data on the link to PCIe TLPdata. For instance, following the end of a defined SMI3 window, a PCIeSTP token (e.g., similar to token 720) can be sent to indicate that thenext window is for the sending of a specified amount of PCIe TLP data.

Memory access flits (e.g., SMI3 flits) may vary in size in someembodiments, making it difficult to predict, a priori, how much data toreserve in the corresponding STP token (e.g., SMI3 STP token) for thememory access payload. As an example, as shown in FIG. 7, SMI3 STP 725can have a length field indicating that 244 bytes of SMI3 data is to beexpected following the SMI3 STP 725. However, in this example, only tenflits (e.g., SMI3 Flits 0-9) are ready to be sent during the window andthese ten SMI3 flits only utilize 240 of the 244 bytes. Accordingly,four (4) bytes of empty bandwidth is left, and these are filled with IDLtokens. This can be particularly suboptimal when PCIe TLP data is queuedand waiting for the SMI3 window to close. In other cases, the windowprovided for the sending of SMI3 flits may be insufficient to send theamount of SMI3 data ready for the lane. Arbitration techniques can beemployed to determine how to arbitrate between SMI3 and PCIe TLP datacoexisting on the link. Further, in some implementations, the length ofthe SMI3 windows can be dynamically modified to assist in more efficientuse of the link. For instance, arbitration or other logic can monitorhow well the defined SMI3 windows are utilized to determine whether thedefined window length can be better optimized to the amount of SMI3 (andcompeting PCIe TLP traffic) expected for the lane. Accordingly, in suchimplementations, the length field values of SMI3 STP tokens can bedynamically adjusted (e.g., between different values) depending on theamount of link bandwidth that SMI3 flit data should be allocated (e.g.,relative to other PCIe data, including TLP, DLLP, and ordered set data),among other examples.

Turning to FIG. 8, a representation 800 of another exampleimplementation of SML is illustrated. In this alternative embodiment,SML can provide for interleaving SMI3 and PCIe protocols through amodified PCIe framing token. As noted above, an EDS token can be used inPCIe to indicate an end of a data stream and indicate that the nextblock will be an ordered set block. In the example of FIG. 8, SML candefine an SMI3 EDS token (e.g., 805) that indicates the end of a TLPdata stream and the transition to SMI3 flit transmissions. An SMI3EDS(e.g., 805) can be defined by encoding a portion of the reserved bits ofthe traditional EDS token to indicate that SMI3 data is to follow,rather than PCIe ordered sets or other data that is to follow a PCIeEDS. Unlike the traditional EDS token, the SMI3 EDS can be sent atessentially anywhere within a PCIe data block. This can permitadditional flexibility in sending SMI3 data and accommodatingcorresponding low-latency shared memory transactions. For instance, atransition from PCIe to SMI3 can be accomplished with a single doubleword (DW) of overhead. Further, as with traditional EDS tokens, anexample SMI3 EDS may not specify a length associated with the SMI3 datathat is to follow the token. Following an SMI3 EDS, PCIe TLP data canconclude and SMI3 flits proceed on the link. SMI3 traffic can proceeduntil SMI3 logic passes control back to PCIe logic. In someimplementations, the sending of an SMI3 EDS causes control to be passedfrom PCIe logic to SMI3 logic provided, for instance, on devicesconnected on the link.

In one example, SMI3 (or another protocol) can define its own linkcontrol signaling for use in performing link layer control. For example,in one implementation, SML can define a specialized version of a SMI3link layer control (LLCTRL) flit (e.g., 810) that indicates a transitionfrom SMI3 back to PCIe protocol. As with an SMI3 EDS, the defined LLCTRLflit (e.g., 810) can cause control to be passed from SMI3 logic back toPCIe logic. In some cases, as shown in the example of FIG. 8, thedefined LLCTRL flit (e.g., 810) can be padded with a predefined numberof LLCTRL idle (LLCTRL-IDLE) flits (e.g., 815) before completing thetransition to PCIe. For instance, the number of LLCTRL-IDLE flits 815 tobe sent to pad the SMI3 LLCTRL flit 810 can depend on the latency todecode the defined SMI3 LLCTRL flit 810 signaling the transition. Aftercompleting the transition back to PCIe, an STP packet can be sent andTLP packet data can recommence on the link under control of PCIe.

It should be appreciated that the implementations described herein areprovided as examples to illustrate certain principles and featuresdisclosed in the Specification. It should be appreciated thatalternative configurations, protocols, and architectures (other thanthose specifically discussed in the examples) can utilize and apply suchprinciples and features. As an example of one alternative, PCIe memoryread/write can be used (e.g., instead of SMI3 protocol) that is enhancedwith directory information. The directory information can be implementedthrough reserve bits of the PCIe packet. In another example, CPU nodescan utilize a cache controller (e.g., as an alternative to a sharedmemory controller) to send memory read/write transactions on a PCIelink, for instance, based on a remote address range check, among otherpotential examples and alternatives.

Turning to FIGS. 9A-9D, flowcharts 900 a-d are shown illustratingexample techniques for communicating using an MCPL. For instance, inFIG. 9A, a load/store memory access message can be received 905 from afirst node, the message requesting particular data of a shared memory.Access to the particular data can be provided 910 to the first node. Asecond load/store memory access message can be received 915 from asecond independent node. The second message can request access to thesame particular data of the shared memory and access to the particulardata can be provided 920 to the second node. Data in shared memory canthus be shared and accessed by multiple different independent nodes.

In the example of FIG. 9B, a first sync header (such as a PCIe syncheader) can be received 925 with a first encoding. The encoding canindicate a transition from an interconnect protocol to a memory accessprotocol and the transition can be identified 930 from the first syncheader. Data of the memory access protocol can be received following thefirst sync header and the data can be processed 935 (e.g., consistentwith the memory access protocol). In some examples, the memory accessprotocol data can include transactions involving shared memory shared bymultiple independent nodes. A second sync header can be received 940that includes a second, different encoding that indicates a transitionfrom the interconnect protocol. The transition from the memory accessprotocol back to the interconnect protocol can be identified 945 fromthe second sync header.

Turning to FIG. 9C, in some instances, a first start of data token(e.g., a PCIe STP token) can be received 950 that includes one or morevalues encoded to identify a transition from an interconnect protocol toa memory access protocol. Data of the memory access protocol can arrivefollowing the first start of data token and can be identified 955. Thedata of the memory access protocol can be processed 960. A length fieldcan be included in the first start of data token indicating when data isto transition back to interconnect protocol data. Indeed, in someimplementations, the length field of a start of data token can beencoded to indicate a length that corresponds to data of the memoryaccess protocol. Further, a second, different start of data framingtoken can be defined that is to be interpreted to correspond to arrivalof data of the interconnect protocol. Each of the first and second startof data framing tokens can be defined according to the interconnectprotocol (e.g., PCIe), among other examples.

In the example of FIG. 9D, an end of stream token (e.g., a specializedPCIe EDS token) can be received 965 that is encoded to indicate atransition to memory access protocol data. The received end of streamtoken can cause a transition 970 from link layer logic for processinginterconnect protocol data to link layer logic for processing memoryaccess protocol data. Data of the memory access protocol can be received975 and processed using the link layer logic of the memory accessprotocol. Link layer control data of the memory access protocol can bereceived 980 (e.g., at the end of the data of the memory accessprotocol) to indicate a transition to data of the interconnect protocol.Receiving 980 the link layer control data can cause a transition 985from the link layer logic of the memory access protocol to the linklayer logic of the interconnect protocol. Data of the interconnectprotocol can be received following the link layer control data and canbe processed by the link layer logic of the interconnect protocolfollowing the transition 985, among other examples.

It should be noted that while much of the above principles and examplesare described within the context of PCIe and particular revisions of thePCIe specification, the principles, solutions, and features describedherein can be equally applicable to other protocols and systems. Forinstance, analogous lane errors can be detected in other links usingother protocols based on analogous symbols, data streams, and tokens, aswell as rules specified for the use, placement, and formatting of suchstructures within data transmitted over these other links. Further,alternative mechanisms and structures (e.g., beside a PCIe LES registeror SKP OS) can be used to provide lane error detection and reportingfunctionality within a system. Moreover, combinations of the abovesolutions can be applied within systems, including combinations oflogical and physical enhancements to a link and its corresponding logicas described herein, among other examples.

Note that the apparatus', methods', and systems described above may beimplemented in any electronic device or system as aforementioned. Asspecific illustrations, the figures below provide exemplary systems forutilizing the invention as described herein. As the systems below aredescribed in more detail, a number of different interconnects aredisclosed, described, and revisited from the discussion above. And as isreadily apparent, the advances described above may be applied to any ofthose interconnects, fabrics, or architectures.

Referring to FIG. 10, an embodiment of a block diagram for a computingsystem including a multicore processor is depicted. Processor 1000includes any processor or processing device, such as a microprocessor,an embedded processor, a digital signal processor (DSP), a networkprocessor, a handheld processor, an application processor, aco-processor, a system on a chip (SOC), or other device to execute code.Processor 1000, in one embodiment, includes at least two cores—core 1001and 1002, which may include asymmetric cores or symmetric cores (theillustrated embodiment). However, processor 1000 may include any numberof processing elements that may be symmetric or asymmetric.

In one embodiment, a processing element refers to hardware or logic tosupport a software thread. Examples of hardware processing elementsinclude: a thread unit, a thread slot, a thread, a process unit, acontext, a context unit, a logical processor, a hardware thread, a core,and/or any other element, which is capable of holding a state for aprocessor, such as an execution state or architectural state. In otherwords, a processing element, in one embodiment, refers to any hardwarecapable of being independently associated with code, such as a softwarethread, operating system, application, or other code. A physicalprocessor (or processor socket) typically refers to an integratedcircuit, which potentially includes any number of other processingelements, such as cores or hardware threads.

A core often refers to logic located on an integrated circuit capable ofmaintaining an independent architectural state, wherein eachindependently maintained architectural state is associated with at leastsome dedicated execution resources. In contrast to cores, a hardwarethread typically refers to any logic located on an integrated circuitcapable of maintaining an independent architectural state, wherein theindependently maintained architectural states share access to executionresources. As can be seen, when certain resources are shared and othersare dedicated to an architectural state, the line between thenomenclature of a hardware thread and core overlaps. Yet often, a coreand a hardware thread are viewed by an operating system as individuallogical processors, where the operating system is able to individuallyschedule operations on each logical processor.

Physical processor 1000, as illustrated in FIG. 10, includes twocores—core 1001 and 1002. Here, core 1001 and 1002 are consideredsymmetric cores, i.e. cores with the same configurations, functionalunits, and/or logic. In another embodiment, core 1001 includes anout-of-order processor core, while core 1002 includes an in-orderprocessor core.

However, cores 1001 and 1002 may be individually selected from any typeof core, such as a native core, a software managed core, a core adaptedto execute a native Instruction Set Architecture (ISA), a core adaptedto execute a translated Instruction Set Architecture (ISA), aco-designed core, or other known core. In a heterogeneous coreenvironment (i.e. asymmetric cores), some form of translation, such abinary translation, may be utilized to schedule or execute code on oneor both cores. Yet to further the discussion, the functional unitsillustrated in core 1001 are described in further detail below, as theunits in core 1002 operate in a similar manner in the depictedembodiment.

As depicted, core 1001 includes two hardware threads 1001 a and 1001 b,which may also be referred to as hardware thread slots 1001 a and 1001b. Therefore, software entities, such as an operating system, in oneembodiment potentially view processor 1000 as four separate processors,i.e., four logical processors or processing elements capable ofexecuting four software threads concurrently. As alluded to above, afirst thread is associated with architecture state registers 1001 a, asecond thread is associated with architecture state registers 1001 b, athird thread may be associated with architecture state registers 1002 a,and a fourth thread may be associated with architecture state registers1002 b. Here, each of the architecture state registers (1001 a, 1001 b,1002 a, and 1002 b) may be referred to as processing elements, threadslots, or thread units, as described above. As illustrated, architecturestate registers 1001 a are replicated in architecture state registers1001 b, so individual architecture states/contexts are capable of beingstored for logical processor 1001 a and logical processor 1001 b. Incore 1001, other smaller resources, such as instruction pointers andrenaming logic in allocator and renamer block 1030 may also bereplicated for threads 1001 a and 1001 b. Some resources, such asre-order buffers in reorder/retirement unit 1035, ILTB 1020, load/storebuffers, and queues may be shared through partitioning. Other resources,such as general purpose internal registers, page-table base register(s),low-level data-cache and data-TLB 1015, execution unit(s) 1040, andportions of out-of-order unit 1035 are potentially fully shared.

Processor 1000 often includes other resources, which may be fullyshared, shared through partitioning, or dedicated by/to processingelements. In FIG. 10, an embodiment of a purely exemplary processor withillustrative logical units/resources of a processor is illustrated. Notethat a processor may include, or omit, any of these functional units, aswell as include any other known functional units, logic, or firmware notdepicted. As illustrated, core 1001 includes a simplified,representative out-of-order (OOO) processor core.

But an in-order processor may be utilized in different embodiments. TheOOO core includes a branch target buffer 1020 to predict branches to beexecuted/taken and an instruction-translation buffer (I-TLB) 1020 tostore address translation entries for instructions.

Core 1001 further includes decode module 1025 coupled to fetch unit 1020to decode fetched elements. Fetch logic, in one embodiment, includesindividual sequencers associated with thread slots 1001 a, 1001 b,respectively. Usually core 1001 is associated with a first ISA, whichdefines/specifies instructions executable on processor 1000. Oftenmachine code instructions that are part of the first ISA include aportion of the instruction (referred to as an opcode), whichreferences/specifies an instruction or operation to be performed. Decodelogic 1025 includes circuitry that recognizes these instructions fromtheir opcodes and passes the decoded instructions on in the pipeline forprocessing as defined by the first ISA. For example, as discussed inmore detail below decoders 1025, in one embodiment, include logicdesigned or adapted to recognize specific instructions, such astransactional instruction. As a result of the recognition by decoders1025, the architecture or core 1001 takes specific, predefined actionsto perform tasks associated with the appropriate instruction. It isimportant to note that any of the tasks, blocks, operations, and methodsdescribed herein may be performed in response to a single or multipleinstructions; some of which may be new or old instructions. Notedecoders 1026, in one embodiment, recognize the same ISA (or a subsetthereof). Alternatively, in a heterogeneous core environment, decoders1026 recognize a second ISA (either a subset of the first ISA or adistinct ISA).

In one example, allocator and renamer block 1030 includes an allocatorto reserve resources, such as register files to store instructionprocessing results. However, threads 1001 a and 1001 b are potentiallycapable of out-of-order execution, where allocator and renamer block1030 also reserves other resources, such as reorder buffers to trackinstruction results. Unit 1030 may also include a register renamer torename program/instruction reference registers to other registersinternal to processor 1000. Reorder/retirement unit 1035 includescomponents, such as the reorder buffers mentioned above, load buffers,and store buffers, to support out-of-order execution and later in-orderretirement of instructions executed out-of-order.

Scheduler and execution unit(s) block 1040, in one embodiment, includesa scheduler unit to schedule instructions/operation on execution units.For example, a floating point instruction is scheduled on a port of anexecution unit that has an available floating point execution unit.Register files associated with the execution units are also included tostore information instruction processing results. Exemplary executionunits include a floating point execution unit, an integer executionunit, a jump execution unit, a load execution unit, a store executionunit, and other known execution units.

Lower level data cache and data translation buffer (D-TLB) 1050 arecoupled to execution unit(s) 1040. The data cache is to store recentlyused/operated on elements, such as data operands, which are potentiallyheld in memory coherency states. The D-TLB is to store recentvirtual/linear to physical address translations. As a specific example,a processor may include a page table structure to break physical memoryinto a plurality of virtual pages.

Here, cores 1001 and 1002 share access to higher-level or further-outcache, such as a second level cache associated with on-chip interface1010. Note that higher-level or further-out refers to cache levelsincreasing or getting further way from the execution unit(s). In oneembodiment, higher-level cache is a last-level data cache—last cache inthe memory hierarchy on processor 1000—such as a second or third leveldata cache. However, higher level cache is not so limited, as it may beassociated with or include an instruction cache. A trace cache—a type ofinstruction cache—instead may be coupled after decoder 1025 to storerecently decoded traces. Here, an instruction potentially refers to amacro-instruction (i.e. a general instruction recognized by thedecoders), which may decode into a number of micro-instructions(micro-operations).

In the depicted configuration, processor 1000 also includes on-chipinterface module 1010. Historically, a memory controller, which isdescribed in more detail below, has been included in a computing systemexternal to processor 1000. In this scenario, on-chip interface 1010 isto communicate with devices external to processor 1000, such as systemmemory 1075, a chipset (often including a memory controller hub toconnect to memory 1075 and an I/O controller hub to connect peripheraldevices), a memory controller hub, a northbridge, or other integratedcircuit. And in this scenario, bus 1005 may include any knowninterconnect, such as multi-drop bus, a point-to-point interconnect, aserial interconnect, a parallel bus, a coherent (e.g. cache coherent)bus, a layered protocol architecture, a differential bus, and a GTL bus.

Memory 1075 may be dedicated to processor 1000 or shared with otherdevices in a system. Common examples of types of memory 1075 includeDRAM, SRAM, non-volatile memory (NV memory), and other known storagedevices. Note that device 1080 may include a graphic accelerator,processor or card coupled to a memory controller hub, data storagecoupled to an I/O controller hub, a wireless transceiver, a flashdevice, an audio controller, a network controller, or other knowndevice.

Recently however, as more logic and devices are being integrated on asingle die, such as SOC, each of these devices may be incorporated onprocessor 1000. For example in one embodiment, a memory controller hubis on the same package and/or die with processor 1000. Here, a portionof the core (an on-core portion) 1010 includes one or more controller(s)for interfacing with other devices such as memory 1075 or a graphicsdevice 1080. The configuration including an interconnect and controllersfor interfacing with such devices is often referred to as an on-core (orun-core configuration). As an example, on-chip interface 1010 includes aring interconnect for on-chip communication and a high-speed serialpoint-to-point link 1005 for off-chip communication. Yet, in the SOCenvironment, even more devices, such as the network interface,co-processors, memory 1075, graphics processor 1080, and any other knowncomputer devices/interface may be integrated on a single die orintegrated circuit to provide small form factor with high functionalityand low power consumption.

In one embodiment, processor 1000 is capable of executing a compiler,optimization, and/or translator code 1077 to compile, translate, and/oroptimize application code 1076 to support the apparatus and methodsdescribed herein or to interface therewith. A compiler often includes aprogram or set of programs to translate source text/code into targettext/code. Usually, compilation of program/application code with acompiler is done in multiple phases and passes to transform hi-levelprogramming language code into low-level machine or assembly languagecode. Yet, single pass compilers may still be utilized for simplecompilation. A compiler may utilize any known compilation techniques andperform any known compiler operations, such as lexical analysis,preprocessing, parsing, semantic analysis, code generation, codetransformation, and code optimization.

Larger compilers often include multiple phases, but most often thesephases are included within two general phases: (1) a front-end, i.e.generally where syntactic processing, semantic processing, and sometransformation/optimization may take place, and (2) a back-end, i.e.generally where analysis, transformations, optimizations, and codegeneration takes place. Some compilers refer to a middle, whichillustrates the blurring of delineation between a front-end and back endof a compiler. As a result, reference to insertion, association,generation, or other operation of a compiler may take place in any ofthe aforementioned phases or passes, as well as any other known phasesor passes of a compiler. As an illustrative example, a compilerpotentially inserts operations, calls, functions, etc. in one or morephases of compilation, such as insertion of calls/operations in afront-end phase of compilation and then transformation of thecalls/operations into lower-level code during a transformation phase.Note that during dynamic compilation, compiler code or dynamicoptimization code may insert such operations/calls, as well as optimizethe code for execution during runtime. As a specific illustrativeexample, binary code (already compiled code) may be dynamicallyoptimized during runtime. Here, the program code may include the dynamicoptimization code, the binary code, or a combination thereof

Similar to a compiler, a translator, such as a binary translator,translates code either statically or dynamically to optimize and/ortranslate code. Therefore, reference to execution of code, applicationcode, program code, or other software environment may refer to: (1)execution of a compiler program(s), optimization code optimizer, ortranslator either dynamically or statically, to compile program code, tomaintain software structures, to perform other operations, to optimizecode, or to translate code; (2) execution of main program code includingoperations/calls, such as application code that has beenoptimized/compiled; (3) execution of other program code, such aslibraries, associated with the main program code to maintain softwarestructures, to perform other software related operations, or to optimizecode; or (4) a combination thereof

Referring now to FIG. 11, shown is a block diagram of an embodiment of amulticore processor. As shown in the embodiment of FIG. 11, processor1100 includes multiple domains. Specifically, a core domain 1130includes a plurality of cores 1130A-1130N, a graphics domain 1160includes one or more graphics engines having a media engine 1165, and asystem agent domain 1110.

In various embodiments, system agent domain 1110 handles power controlevents and power management, such that individual units of domains 1130and 1160 (e.g. cores and/or graphics engines) are independentlycontrollable to dynamically operate at an appropriate power mode/level(e.g. active, turbo, sleep, hibernate, deep sleep, or other AdvancedConfiguration Power Interface like state) in light of the activity (orinactivity) occurring in the given unit. Each of domains 1130 and 1160may operate at different voltage and/or power, and furthermore theindividual units within the domains each potentially operate at anindependent frequency and voltage. Note that while only shown with threedomains, understand the scope of the present invention is not limited inthis regard and additional domains may be present in other embodiments.

As shown, each core 1130 further includes low level caches in additionto various execution units and additional processing elements. Here, thevarious cores are coupled to each other and to a shared cache memorythat is formed of a plurality of units or slices of a last level cache(LLC) 1140A-1140N; these LLCs often include storage and cache controllerfunctionality and are shared amongst the cores, as well as potentiallyamong the graphics engine too.

As seen, a ring interconnect 1150 couples the cores together, andprovides interconnection between the core domain 1130, graphics domain1160 and system agent circuitry 1110, via a plurality of ring stops1152A-1152N, each at a coupling between a core and LLC slice. As seen inFIG. 11, interconnect 1150 is used to carry various information,including address information, data information, acknowledgementinformation, and snoop/invalid information. Although a ring interconnectis illustrated, any known on-die interconnect or fabric may be utilized.As an illustrative example, some of the fabrics discussed above (e.g.another on-die interconnect, On-chip System Fabric (OSF), an AdvancedMicrocontroller Bus Architecture (AMBA) interconnect, amulti-dimensional mesh fabric, or other known interconnect architecture)may be utilized in a similar fashion.

As further depicted, system agent domain 1110 includes display engine1112 which is to provide control of and an interface to an associateddisplay. System agent domain 1110 may include other units, such as: anintegrated memory controller 1120 that provides for an interface to asystem memory (e.g., a DRAM implemented with multiple DIMMs; coherencelogic 1122 to perform memory coherence operations. Multiple interfacesmay be present to enable interconnection between the processor and othercircuitry. For example, in one embodiment at least one direct mediainterface (DMI) 1116 interface is provided as well as one or more PCIe™interfaces 1114. The display engine and these interfaces typicallycouple to memory via a PCIe™ bridge 1118. Still further, to provide forcommunications between other agents, such as additional processors orother circuitry, one or more other interfaces may be provided.

Referring now to FIG. 12, shown is a block diagram of a representativecore; specifically, logical blocks of a back-end of a core, such as core1130 from FIG. 11. In general, the structure shown in FIG. 12 includesan out-of-order processor that has a front end unit 1270 used to fetchincoming instructions, perform various processing (e.g. caching,decoding, branch predicting, etc.) and passing instructions/operationsalong to an out-of-order (OOO) engine 1280. OOO engine 1280 performsfurther processing on decoded instructions.

Specifically in the embodiment of FIG. 12, out-of-order engine 1280includes an allocate unit 1282 to receive decoded instructions, whichmay be in the form of one or more micro-instructions or uops, from frontend unit 1270, and allocate them to appropriate resources such asregisters and so forth. Next, the instructions are provided to areservation station 1284, which reserves resources and schedules themfor execution on one of a plurality of execution units 1286A-1286N.Various types of execution units may be present, including, for example,arithmetic logic units (ALUs), load and store units, vector processingunits (VPUs), floating point execution units, among others. Results fromthese different execution units are provided to a reorder buffer (ROB)1288, which take unordered results and return them to correct programorder.

Still referring to FIG. 12, note that both front end unit 1270 andout-of-order engine 1280 are coupled to different levels of a memoryhierarchy. Specifically shown is an instruction level cache 1272, thatin turn couples to a mid-level cache 1276, that in turn couples to alast level cache 1295. In one embodiment, last level cache 1295 isimplemented in an on-chip (sometimes referred to as uncore) unit 1290.As an example, unit 1290 is similar to system agent 1110 of FIG. 11. Asdiscussed above, uncore 1290 communicates with system memory 1299,which, in the illustrated embodiment, is implemented via ED RAM. Notealso that the various execution units 1286 within out-of-order engine1280 are in communication with a first level cache 1274 that also is incommunication with mid-level cache 1276. Note also that additional cores1230N-2-1230N can couple to LLC 1295. Although shown at this high levelin the embodiment of FIG. 12, understand that various alterations andadditional components may be present.

Turning to FIG. 13, a block diagram of an exemplary computer systemformed with a processor that includes execution units to execute aninstruction, where one or more of the interconnects implement one ormore features in accordance with one embodiment of the present inventionis illustrated. System 1300 includes a component, such as a processor1302 to employ execution units including logic to perform algorithms forprocess data, in accordance with the present invention, such as in theembodiment described herein. System 1300 is representative of processingsystems based on the PENTIUM III™, PENTIUM 4™ Xeon™, Itanium, XScale™and/or StrongARM™ microprocessors, although other systems (including PCshaving other microprocessors, engineering workstations, set-top boxesand the like) may also be used. In one embodiment, sample system 1300executes a version of the WINDOWS™ operating system available fromMicrosoft Corporation of Redmond, Wash., although other operatingsystems (UNIX and Linux for example), embedded software, and/orgraphical user interfaces, may also be used. Thus, embodiments of thepresent invention are not limited to any specific combination ofhardware circuitry and software.

Embodiments are not limited to computer systems. Alternative embodimentsof the present invention can be used in other devices such as handhelddevices and embedded applications. Some examples of handheld devicesinclude cellular phones, Internet Protocol devices, digital cameras,personal digital assistants (PDAs), and handheld PCs. Embeddedapplications can include a micro controller, a digital signal processor(DSP), system on a chip, network computers (NetPC), set-top boxes,network hubs, wide area network (WAN) switches, or any other system thatcan perform one or more instructions in accordance with at least oneembodiment.

In this illustrated embodiment, processor 1302 includes one or moreexecution units 1308 to implement an algorithm that is to perform atleast one instruction. One embodiment may be described in the context ofa single processor desktop or server system, but alternative embodimentsmay be included in a multiprocessor system. System 1300 is an example ofa ‘hub’ system architecture. The computer system 1300 includes aprocessor 1302 to process data signals. The processor 1302, as oneillustrative example, includes a complex instruction set computer (CISC)microprocessor, a reduced instruction set computing (RISC)microprocessor, a very long instruction word (VLIW) microprocessor, aprocessor implementing a combination of instruction sets, or any otherprocessor device, such as a digital signal processor, for example. Theprocessor 1302 is coupled to a processor bus 1310 that transmits datasignals between the processor 1302 and other components in the system1300. The elements of system 1300 (e.g. graphics accelerator 1312,memory controller hub 1316, memory 1320, I/O controller hub 1324,wireless transceiver 1326, Flash BIOS 1328, Network controller 1334,Audio controller 1336, Serial expansion port 1338, I/O controller 1340,etc.) perform their conventional functions that are well known to thosefamiliar with the art.

In one embodiment, the processor 1302 includes a Level 1 (L1) internalcache memory 1304. Depending on the architecture, the processor 1302 mayhave a single internal cache or multiple levels of internal caches.Other embodiments include a combination of both internal and externalcaches depending on the particular implementation and needs. Registerfile 1306 is to store different types of data in various registersincluding integer registers, floating point registers, vector registers,banked registers, shadow registers, checkpoint registers, statusregisters, and instruction pointer register.

Execution unit 1308, including logic to perform integer and floatingpoint operations, also resides in the processor 1302. The processor1302, in one embodiment, includes a microcode (ucode) ROM to storemicrocode, which when executed, is to perform algorithms for certainmacroinstructions or handle complex scenarios. Here, microcode ispotentially updateable to handle logic bugs/fixes for processor 1302.For one embodiment, execution unit 1308 includes logic to handle apacked instruction set 1309. By including the packed instruction set1309 in the instruction set of a general-purpose processor 1302, alongwith associated circuitry to execute the instructions, the operationsused by many multimedia applications may be performed using packed datain a general-purpose processor 1302. Thus, many multimedia applicationsare accelerated and executed more efficiently by using the full width ofa processor's data bus for performing operations on packed data. Thispotentially eliminates the need to transfer smaller units of data acrossthe processor's data bus to perform one or more operations, one dataelement at a time.

Alternate embodiments of an execution unit 1308 may also be used inmicro controllers, embedded processors, graphics devices, DSPs, andother types of logic circuits. System 1300 includes a memory 1320.Memory 1320 includes a dynamic random access memory (DRAM) device, astatic random access memory (SRAM) device, flash memory device, or othermemory device. Memory 1320 stores instructions and/or data representedby data signals that are to be executed by the processor 1302.

Note that any of the aforementioned features or aspects of the inventionmay be utilized on one or more interconnect illustrated in FIG. 13. Forexample, an on-die interconnect (ODI), which is not shown, for couplinginternal units of processor 1302 implements one or more aspects of theinvention described above. Or the invention is associated with aprocessor bus 1310 (e.g. other known high performance computinginterconnect), a high bandwidth memory path 1318 to memory 1320, apoint-to-point link to graphics accelerator 1312 (e.g. a PeripheralComponent Interconnect express (PCIe) compliant fabric), a controllerhub interconnect 1322, an I/O or other interconnect (e.g. USB, PCI,PCIe) for coupling the other illustrated components. Some examples ofsuch components include the audio controller 1336, firmware hub (flashBIOS) 1328, wireless transceiver 1326, data storage 1324, legacy I/Ocontroller 1310 containing user input and keyboard interfaces 1342, aserial expansion port 1338 such as Universal Serial Bus (USB), and anetwork controller 1334. The data storage device 1324 can comprise ahard disk drive, a floppy disk drive, a CD-ROM device, a flash memorydevice, or other mass storage device.

Referring now to FIG. 14, shown is a block diagram of a second system1400 in accordance with an embodiment of the present invention. As shownin FIG. 14, multiprocessor system 1400 is a point-to-point interconnectsystem, and includes a first processor 1470 and a second processor 1480coupled via a point-to-point interconnect 1450. Each of processors 1470and 1480 may be some version of a processor. In one embodiment, 1452 and1454 are part of a serial, point-to-point coherent interconnect fabric,such as a high-performance architecture. As a result, the invention maybe implemented within the QPI architecture.

While shown with only two processors 1470, 1480, it is to be understoodthat the scope of the present invention is not so limited. In otherembodiments, one or more additional processors may be present in a givenprocessor.

Processors 1470 and 1480 are shown including integrated memorycontroller units 1472 and 1482, respectively. Processor 1470 alsoincludes as part of its bus controller units point-to-point (P-P)interfaces 1476 and 1478; similarly, second processor 1480 includes P-Pinterfaces 1486 and 1488. Processors 1470, 1480 may exchange informationvia a point-to-point (P-P) interface 1450 using P-P interface circuits1478, 1488. As shown in FIG. 14, IMCs 1472 and 1482 couple theprocessors to respective memories, namely a memory 1432 and a memory1434, which may be portions of main memory locally attached to therespective processors.

Processors 1470, 1480 each exchange information with a chipset 1490 viaindividual P-P interfaces 1452, 1454 using point to point interfacecircuits 1476, 1494, 1486, 1498. Chipset 1490 also exchanges informationwith a high-performance graphics circuit 1438 via an interface circuit1492 along a high-performance graphics interconnect 1439.

A shared cache (not shown) may be included in either processor oroutside of both processors; yet connected with the processors via P-Pinterconnect, such that either or both processors' local cacheinformation may be stored in the shared cache if a processor is placedinto a low power mode.

Chipset 1490 may be coupled to a first bus 1416 via an interface 1496.In one embodiment, first bus 1416 may be a Peripheral ComponentInterconnect (PCI) bus, or a bus such as a PCI Express bus or anotherthird generation I/O interconnect bus, although the scope of the presentinvention is not so limited.

As shown in FIG. 14, various I/O devices 1414 are coupled to first bus1416, along with a bus bridge 1418 which couples first bus 1416 to asecond bus 1420. In one embodiment, second bus 1420 includes a low pincount (LPC) bus. Various devices are coupled to second bus 1420including, for example, a keyboard and/or mouse 1422, communicationdevices 1427 and a storage unit 1428 such as a disk drive or other massstorage device which often includes instructions/code and data 1430, inone embodiment. Further, an audio I/O 1424 is shown coupled to secondbus 1420. Note that other architectures are possible, where the includedcomponents and interconnect architectures vary. For example, instead ofthe point-to-point architecture of FIG. 14, a system may implement amulti-drop bus or other such architecture.

Turning next to FIG. 15, an embodiment of a system on-chip (SOC) designin accordance with the inventions is depicted. As a specificillustrative example, SOC 1500 is included in user equipment (UE). Inone embodiment, UE refers to any device to be used by an end-user tocommunicate, such as a hand-held phone, smartphone, tablet, ultra-thinnotebook, notebook with broadband adapter, or any other similarcommunication device. Often a UE connects to a base station or node,which potentially corresponds in nature to a mobile station (MS) in aGSM network.

Here, SOC 1500 includes 2 cores—1506 and 1507. Similar to the discussionabove, cores 1506 and 1507 may conform to an Instruction SetArchitecture, such as an Intel® Architecture Core™-based processor, anAdvanced Micro Devices, Inc. (AMD) processor, a MIPS-based processor, anARM-based processor design, or a customer thereof, as well as theirlicensees or adopters. Cores 1506 and 1507 are coupled to cache control1508 that is associated with bus interface unit 1509 and L2 cache 1511to communicate with other parts of system 1500. Interconnect 1510includes an on-chip interconnect, such as an IOSF, AMBA, or otherinterconnect discussed above, which potentially implements one or moreaspects of described herein.

Interface 1510 provides communication channels to the other components,such as a Subscriber Identity Module (SIM) 1530 to interface with a SIMcard, a boot rom 1535 to hold boot code for execution by cores 1506 and1507 to initialize and boot SOC 1500, a SDRAM controller 1540 tointerface with external memory (e.g. DRAM 1560), a flash controller 1545to interface with non-volatile memory (e.g. Flash 1565), a peripheralcontrol 1550 (e.g. Serial Peripheral Interface) to interface withperipherals, video codecs 1520 and Video interface 1525 to display andreceive input (e.g. touch enabled input), GPU 1515 to perform graphicsrelated computations, etc. Any of these interfaces may incorporateaspects of the invention described herein.

In addition, the system illustrates peripherals for communication, suchas a Bluetooth module 1570, 3G modem 1575, GPS 1585, and WiFi 1585. Noteas stated above, a UE includes a radio for communication. As a result,these peripheral communication modules are not all required. However, ina UE some form a radio for external communication is to be included.

While the present invention has been described with respect to a limitednumber of embodiments, those skilled in the art will appreciate numerousmodifications and variations therefrom. It is intended that the appendedclaims cover all such modifications and variations as fall within thetrue spirit and scope of this present invention.

A design may go through various stages, from creation to simulation tofabrication. Data representing a design may represent the design in anumber of manners. First, as is useful in simulations, the hardware maybe represented using a hardware description language or anotherfunctional description language. Additionally, a circuit level modelwith logic and/or transistor gates may be produced at some stages of thedesign process. Furthermore, most designs, at some stage, reach a levelof data representing the physical placement of various devices in thehardware model. In the case where conventional semiconductor fabricationtechniques are used, the data representing the hardware model may be thedata specifying the presence or absence of various features on differentmask layers for masks used to produce the integrated circuit. In anyrepresentation of the design, the data may be stored in any form of amachine readable medium. A memory or a magnetic or optical storage suchas a disc may be the machine readable medium to store informationtransmitted via optical or electrical wave modulated or otherwisegenerated to transmit such information. When an electrical carrier waveindicating or carrying the code or design is transmitted, to the extentthat copying, buffering, or re-transmission of the electrical signal isperformed, a new copy is made. Thus, a communication provider or anetwork provider may store on a tangible, machine-readable medium, atleast temporarily, an article, such as information encoded into acarrier wave, embodying techniques of embodiments of the presentinvention.

A module as used herein refers to any combination of hardware, software,and/or firmware. As an example, a module includes hardware, such as amicro-controller, associated with a non-transitory medium to store codeadapted to be executed by the micro-controller. Therefore, reference toa module, in one embodiment, refers to the hardware, which isspecifically configured to recognize and/or execute the code to be heldon a non-transitory medium. Furthermore, in another embodiment, use of amodule refers to the non-transitory medium including the code, which isspecifically adapted to be executed by the microcontroller to performpredetermined operations. And as can be inferred, in yet anotherembodiment, the term module (in this example) may refer to thecombination of the microcontroller and the non-transitory medium. Oftenmodule boundaries that are illustrated as separate commonly vary andpotentially overlap. For example, a first and a second module may sharehardware, software, firmware, or a combination thereof, whilepotentially retaining some independent hardware, software, or firmware.In one embodiment, use of the term logic includes hardware, such astransistors, registers, or other hardware, such as programmable logicdevices.

Use of the phrase ‘configured to,’ in one embodiment, refers toarranging, putting together, manufacturing, offering to sell, importingand/or designing an apparatus, hardware, logic, or element to perform adesignated or determined task. In this example, an apparatus or elementthereof that is not operating is still ‘configured to’ perform adesignated task if it is designed, coupled, and/or interconnected toperform said designated task. As a purely illustrative example, a logicgate may provide a 0 or a 1 during operation. But a logic gate‘configured to’ provide an enable signal to a clock does not includeevery potential logic gate that may provide a 1 or 0. Instead, the logicgate is one coupled in some manner that during operation the 1 or 0output is to enable the clock. Note once again that use of the term‘configured to’ does not require operation, but instead focus on thelatent state of an apparatus, hardware, and/or element, where in thelatent state the apparatus, hardware, and/or element is designed toperform a particular task when the apparatus, hardware, and/or elementis operating.

Furthermore, use of the phrases ‘to,’ ‘capable of/to,’ and or ‘operableto,’ in one embodiment, refers to some apparatus, logic, hardware,and/or element designed in such a way to enable use of the apparatus,logic, hardware, and/or element in a specified manner. Note as abovethat use of to, capable to, or operable to, in one embodiment, refers tothe latent state of an apparatus, logic, hardware, and/or element, wherethe apparatus, logic, hardware, and/or element is not operating but isdesigned in such a manner to enable use of an apparatus in a specifiedmanner.

A value, as used herein, includes any known representation of a number,a state, a logical state, or a binary logical state. Often, the use oflogic levels, logic values, or logical values is also referred to as 1'sand 0's, which simply represents binary logic states. For example, a 1refers to a high logic level and 0 refers to a low logic level. In oneembodiment, a storage cell, such as a transistor or flash cell, may becapable of holding a single logical value or multiple logical values.However, other representations of values in computer systems have beenused. For example the decimal number ten may also be represented as abinary value of 1010 and a hexadecimal letter A. Therefore, a valueincludes any representation of information capable of being held in acomputer system.

Moreover, states may be represented by values or portions of values. Asan example, a first value, such as a logical one, may represent adefault or initial state, while a second value, such as a logical zero,may represent a non-default state. In addition, the terms reset and set,in one embodiment, refer to a default and an updated value or state,respectively. For example, a default value potentially includes a highlogical value, i.e. reset, while an updated value potentially includes alow logical value, i.e. set. Note that any combination of values may beutilized to represent any number of states.

The embodiments of methods, hardware, software, firmware or code setforth above may be implemented via instructions or code stored on amachine-accessible, machine readable, computer accessible, or computerreadable medium which are executable by a processing element. Anon-transitory machine-accessible/readable medium includes any mechanismthat provides (i.e., stores and/or transmits) information in a formreadable by a machine, such as a computer or electronic system. Forexample, a non-transitory machine-accessible medium includesrandom-access memory (RAM), such as static RAM (SRAM) or dynamic RAM(DRAM); ROM; magnetic or optical storage medium; flash memory devices;electrical storage devices; optical storage devices; acoustical storagedevices; other form of storage devices for holding information receivedfrom transitory (propagated) signals (e.g., carrier waves, infraredsignals, digital signals); etc, which are to be distinguished from thenon-transitory mediums that may receive information there from.

Instructions used to program logic to perform embodiments of theinvention may be stored within a memory in the system, such as DRAM,cache, flash memory, or other storage. Furthermore, the instructions canbe distributed via a network or by way of other computer readable media.Thus a machine-readable medium may include any mechanism for storing ortransmitting information in a form readable by a machine (e.g., acomputer), but is not limited to, floppy diskettes, optical disks,Compact Disc, Read-Only Memory (CD-ROMs), and magneto-optical disks,Read-Only Memory (ROMs), Random Access Memory (RAM), ErasableProgrammable Read-Only Memory (EPROM), Electrically ErasableProgrammable Read-Only Memory (EEPROM), magnetic or optical cards, flashmemory, or a tangible, machine-readable storage used in the transmissionof information over the Internet via electrical, optical, acoustical orother forms of propagated signals (e.g., carrier waves, infraredsignals, digital signals, etc.). Accordingly, the computer-readablemedium includes any type of tangible machine-readable medium suitablefor storing or transmitting electronic instructions or information in aform readable by a machine (e.g., a computer).

The following examples pertain to embodiments in accordance with thisSpecification. One or more embodiments may provide an apparatus, asystem, a machine readable storage, a machine readable medium, hardware-and/or software-based logic, and a method to provide a shared memorycontroller to service load and store operations from a plurality ofindependent nodes to provide access to a shared memory resource, whereineach of the plurality of independent nodes is to be permitted to accessa respective portion of the shared memory resource.

In at least one example, the load and store operations are communicatedusing a shared memory link protocol.

In at least one example, the share memory link protocol includes amemory access protocol utilizing physical layer logic of a differentinterconnect protocol.

In at least one example, the share memory link protocol provides formultiplexing between transmission of data of the memory access protocoldata and transmission of data of the interconnect protocol.

In at least one example, the data of the interconnect protocol comprisesat least one of link layer data and transaction layer data.

In at least one example, the memory access protocol comprises SMI3 andthe interconnect protocol comprises Peripheral Component Interconnect(PCI) Express (PCIe).

In at least one example, transitions between interconnect protocol dataand memory access protocol data are identified by a sync header encodedto identify the transitions.

In at least one example, transitions between interconnect protocol dataand memory access protocol data are identified by a start of dataframing token encoded to identify the transitions.

In at least one example, transitions from interconnect protocol data tomemory access protocol data are identified by an end of data streamframing token of the interconnect protocol encoded to identify thetransitions, and transitions from memory access protocol data tointerconnect protocol data are identified by link layer control flits ofthe memory access protocol.

In at least one example, the shared memory link protocol is tunneledover a network protocol stack.

In at least one example, the network protocol stack comprises Ethernet.

In at least one example, a first of the plurality of CPU nodes is on afirst board and a second of the plurality of CPU nodes is on a secondboard separate from the first board.

In at least one example, at least two of the plurality of CPU nodes areon the same device.

In at least one example, the shared memory controller is further totrack memory transactions involving the load and store operations.

In at least one example, the shared memory controller is further toidentify that a particular one of the plurality of CPU node fail,identify a portion of the memory transactions of the particular CPUnode, and drop the portion of the memory transactions of the particularCPU node while maintaining all other memory transactions.

In at least one example, the shared memory controller is further tomanage access permissions by the plurality of CPU nodes to data in theshared memory resource.

In at least one example, at least a particular one of the plurality ofCPU nodes is blocked from accessing at least a first portion of theshared memory and a second one of the plurality of CPU nodes ispermitted to access the first portion.

In at least one example, the shared memory controller is further tomanage directory information for data in the shared memory resource.

In at least one example, the directory information identifies for eachof a plurality of data resources stored in the shared memory resource,whether access to the respective data resource is exclusive to one ofthe plurality of CPU nodes or shared between two or more of theplurality of CPU nodes.

In at least one example, the shared memory controller is further tonegotiate a change of access for a particular one of the plurality ofdata resources, wherein the change comprises at least one of changingaccess from shared to exclusive and changing access from exclusive toshared.

In at least one example, the shared memory controller is coupled to atleast one other shared memory controller managing at least one othershared memory resource and the shared memory controller is further tocommunicate load/store operations to the other shared memory controllerto permit the plurality of CPU node to access the other shared memory.

In at least one example, the shared memory controller is further to mapaddress information in the load and store operations to correspondingdata resources stored in the shared memory resource.

One or more embodiments may provide an apparatus, a system, a machinereadable storage, a machine readable medium, hardware- and/orsoftware-based logic, and a method to send a memory access request to ashared memory controller, wherein the memory access request comprises aload/store operation and is to identify an address of a data resource tobe included in a shared memory resource corresponding to the sharedmemory controller, and each of a plurality of independent nodes is to bepermitted to access a respective portion of the shared memory resource.

In at least one example, the memory access request comprises a loadrequest and the I/O logic is further to receive data corresponding tothe data resource in response to the load request.

In at least one example, the memory access request comprises a storerequest.

In at least one example, the memory access request is sent using ashared memory link protocol and the share memory link protocol includesa memory access protocol utilizing physical layer logic of a differentinterconnect protocol.

In at least one example, the share memory link protocol provides formultiplexing between transmission of data of the memory access protocoldata and transmission of data of the interconnect protocol.

In at least one example, transitions between interconnect protocol dataand memory access protocol data are identified by at least one of: (a) async header encoded to identify the transitions; (b) a start of dataframing token encoded to identify the transitions; and (c) an end ofdata stream framing token encoded to identify the transitions.

In at least one example, the memory access protocol comprises SMI3 andthe interconnect protocol comprises a PCIe-based protocol.

In at least one example, a particular one of the plurality of nodescomprises multiple CPU sockets and local memory. In at least oneexample, the shared memory resource is on a device separate from theparticular node.

One or more embodiments may provide an apparatus, a system, a machinereadable storage, a machine readable medium, hardware- and/orsoftware-based logic, and a method to receive a first load/store messagefrom a first independent CPU node that identifies a particular data in ashared memory, provide access to the particular data to the first CPUnode in response to the first load/store message, receive a secondload/store message from a second independent CPU node that identifies aparticular data in a shared memory, and provide access to the particulardata to the second CPU memory in response to the second load/storemessage.

In at least one example, each of the first and second first load/storemessages are received over a data link using a shared memory linkprotocol.

At least some embodiments can provide for identifying that the first CPUnode is permitted to access the particular data and identifying that thesecond CPU node is permitted to access the particular data.

At least some embodiments can provide for tracking transactionsinvolving the shared memory for each of the first and second CPU nodes.

At least some embodiments can provide for identifying directoryinformation of the particular data, where the directory informationidentifies whether the particular data is in a shared, uncached, orexclusive state.

In at least one example, the first load/store message identifies theparticular data by a first address and the second first load/storemessage identifies the particular data by a second, different address.

At least some embodiments can provide for mapping the first address tothe particular data and mapping the second address to the particulardata.

At least some embodiments can provide for a system comprising a firstnode comprising one or more processor devices, a second node independentfrom the first node and comprising one or more processor devices, and ashared memory accessible to each of the first and second nodes through aload/store memory access protocol.

In at least one example, the first node has a fault domain independentof the second node.

In at least one example, the first node is controlled by a firstoperating system and the second node is controlled by a second operatingsystem.

In at least one example, the load/store memory access protocol isincluded in a shared memory link protocol and the shared memory linkprotocol toggling between the memory access protocol and a differentinterconnect protocol.

In at least one example, a shared memory controller can service load andstore operations from the first and second nodes and to provide accessto the shared memory.

One or more embodiments may provide an apparatus, a system, a machinereadable storage, a machine readable medium, hardware- and/orsoftware-based logic, and a method to send a first sync header on lanesof a data link, wherein the first sync header is encoded to identify atransition from data of an interconnect protocol to data of a memoryaccess protocol, and send a second sync header on the lanes of the datalink, wherein the second sync header is to be encoded to identify atransition from data of the memory access protocol to data of theinterconnect protocol.

In at least one example, each sync header identifies a type of a datablock to follow the sync header.

In at least one example, each data block is of a predefined length.

In at least one example, the memory access protocol comprises a protocolbased on SMI3.

In at least one example, the interconnect protocol comprises aPCIe-based protocol-based protocol.

In at least one example, each sync header is encoded according to 128b/130 b encoding.

In at least one example, the second sync header indicates a data blockof the interconnect protocol and a third sync header is to be sent onthe lanes of the data link to indicate an ordered set block of theinterconnect protocol.

In at least one example, the first sync header is encoded withalternating values on the lanes and the second sync header is encodedwith a same value on all of the lanes.

In at least one example, the data of the memory access protocolcomprises link layer data and the data of the interconnect protocolcomprises one of transaction layer and data link layer packets.

In at least one example, the sync headers are defined according to theinterconnect protocol.

In at least one example, the memory access protocol supports load/storememory access messaging.

In at least one example, the memory access protocol data comprisesmemory access messaging for access to a shared memory resource, whereineach of a plurality of independent nodes is permitted to access arespective portion of the shared memory resource.

In at least one example, each of the plurality of independent nodes hasan independent fault domain.

In at least one example, the data link comprises at least four lanes.

One or more embodiments may provide an apparatus, a system, a machinereadable storage, a machine readable medium, hardware- and/orsoftware-based logic, and a method to receive a first sync header onlanes of a data link, wherein the first sync header is encoded with afirst encoding, identify, from the first encoding of the first syncheader, a transition from data of an interconnect protocol to data of amemory access protocol, receive a second sync header on the lanes of thedata link, wherein the second sync header is encoded with a secondencoding, and identify, from the second encoding of the second syncheader, a transition from data of the memory access protocol to data ofthe interconnect protocol.

In at least one example, each sync header identifies a type of a datablock to follow the sync header.

In at least one example, the interconnect protocol comprises aPCIe-based protocol.

In at least one example, the memory access protocol is based on SMI3.

In at least one example, the sync header is encoded according to 128b/130 b encoding.

In at least one example, the first encoding comprises values of 01 b and10 b alternated on the lanes of the data link.

In at least one example, the data of the memory access protocolcomprises load/store memory access messages.

In at least one example, the memory access messages comprise messages toaccess a shared memory resource, and each of a plurality of independentnodes in a system are permitted to access a respective portion of theshared memory resource.

One or more embodiments may provide an apparatus, a system, a machinereadable storage, a machine readable medium, hardware- and/orsoftware-based logic, and a method to receive a first sync header onlanes of a data link, wherein the first sync header is encoded with afirst encoding, identify from the first encoding of the first syncheader a transition from data of an interconnect protocol to data of amemory access protocol, process the data of the memory access protocol,receive a second sync header on the lanes of the data link, wherein thesecond sync header is encoded with a second encoding, and identify, fromthe second encoding of the second sync header, a transition from data ofthe memory access protocol to data of the interconnect protocol.

In at least one example, the interconnect protocol comprises aPCIe-based protocol and the memory access protocol is based on SMI3.

In at least one example, the sync headers are according to PCIe.

In at least one example, the data of the memory access protocol isprocessed to service a memory access request included in the data of thememory access protocol.

In at least one example, the memory access request is a request of ashared memory resource shared between a plurality of independent CPUnodes.

In at least one example, the memory access request comprises aload/store message.

One or more embodiments may provide an apparatus, a system, a machinereadable storage, a machine readable medium, hardware- and/orsoftware-based logic, and a method to send a first start of data framingtoken on lanes of a data link, wherein the first start of data framingtoken is encoded to identify a transition from data of an interconnectprotocol to data of a memory access protocol, and send a second start ofdata framing token on the lanes of the data link, wherein the secondstart of data framing token is encoded to identify a transition fromdata of the memory access protocol to data of the interconnect protocol.

In at least one example, the first start of data framing token comprisesa modified PCIe STP framing token and the second start of data framingtoken comprises a PCIe STP framing token.

In at least one example, each start of data framing token includes alength field.

In at least one example, the transition from data of the interconnectprotocol to data of the memory access protocol is indicated in the firststart of data framing token by a value in the length field of the firststart of data framing token.

In at least one example, the data of the memory access protocol is to besent in a window defined by the length field of the first start of dataframing token.

In at least one example, the memory access protocol is based on SMI3.

In at least one example, the interconnect protocol comprises aPCIe-based protocol.

In at least one example, the data of the memory access protocolcomprises link layer data and the data of the interconnect protocolcomprises one of transaction layer and data link layer packets.

In at least one example, physical layer logic is further to send thedata of the memory access protocol and the data of the memory accessprotocol comprises load/store memory access messages.

In at least one example, the memory access protocol data comprisesmemory access messages to access a shared memory resource, and each of aplurality of independent nodes is permitted to access a respectiveportion of the shared memory resource.

In at least one example, each of the plurality of independent nodes hasan independent fault domain.

In at least one example, the data link comprises one or more lanes.

One or more embodiments may provide an apparatus, a system, a machinereadable storage, a machine readable medium, hardware- and/orsoftware-based logic, and a method to receive a first start of dataframing token on lanes of a data link, identify, from the first start ofdata framing token, arrival of data of a memory access protocol, receivea second start of data framing token on lanes of the data link, whereinthe second start of data framing token is different from the first startof data framing token, and identify, from the second start of dataframing token, arrival of data of an interconnect protocol.

In at least one example, the first start of data framing token comprisesa modified PCIe STP framing token and the second start of data framingtoken comprises a PCIe STP framing token.

In at least one example, each start of data framing token includes alength field.

In at least one example, the transition from data of the interconnectprotocol to data of the memory access protocol is indicated in the firststart of data framing token by a value in the length field of the firststart of data framing token.

In at least one example, the memory access protocol is based on SMI3 andthe interconnect protocol comprises a PCIe-based protocol.

In at least one example, the data of the memory access protocol isreceived and

the data of the interconnect protocol is received.

One or more embodiments may provide an apparatus, a system, a machinereadable storage, a machine readable medium, hardware- and/orsoftware-based logic, and a method to send a first end of data streamframing token on lanes of a data link, wherein the first end of datastream framing token is encoded to identify a transition from aninterconnect protocol to a memory access protocol, send memory accessprotocol data following the transition to the memory access protocol,and send link layer control data of the memory access protocol toidentify a transition from the memory access protocol to theinterconnect protocol.

In at least one example, the memory access protocol data is to be senton the data link until the link layer control data is sent.

In at least one example, the transition to the memory access protocolcauses a transition from interconnect protocol logic handling data onthe data link to memory access protocol logic handling data on the datalink.

In at least one example, the memory access protocol comprises a protocolbased on SMI3.

In at least one example, the interconnect protocol comprises aPCIe-based protocol.

In at least one example, first end of data stream framing tokencomprises a modified PCIe EDS framing token.

In at least one example, a PCIe EDS is sent to indicate an end of a setof PCIe transaction layer packets and an arrival of a PCIe ordered setblock.

In at least one example, the data of the memory access protocolcomprises link layer data and the data of the interconnect protocolcomprises one of transaction layer and data link layer packets.

In at least one example, the data of the memory access protocol is sentand comprises load/store memory access messages.

In at least one example, the memory access protocol data comprisesmemory access messages to access a shared memory resource, and each of aplurality of independent nodes is permitted to access a respectiveportion of the shared memory resource.

In at least one example, each of the plurality of independent nodes hasan independent fault domain.

One or more embodiments may provide an apparatus, a system, a machinereadable storage, a machine readable medium, hardware- and/orsoftware-based logic, and a method to receive a first end of data streamframing token on lanes of a data link that is encoded to identify atransition from an interconnect protocol to a memory access protocol,transition to using link layer logic of the memory access protocol basedon the first end of data stream framing token, receive memory accessprotocol link layer data, receive link layer control data of the memoryaccess protocol to identify a transition from the memory access protocolto the interconnect protocol, and transition to using link layer logicof the interconnect protocol based on the link layer control data.

In at least one example, the memory access protocol is based on SMI3.

In at least one example, the interconnect protocol comprises aPCIe-based protocol.

In at least one example, the first end of data stream framing tokencomprises a modified PCIe EDS framing token.

In at least one example, the data of the memory access protocolcomprises link layer data and the data of the interconnect protocolcomprises one of transaction layer and data link layer packets.

In at least one example, the data of the memory access protocolcomprises load/store memory access messages.

Reference throughout this specification to “one embodiment” or “anembodiment” means that a particular feature, structure, orcharacteristic described in connection with the embodiment is includedin at least one embodiment of the present invention. Thus, theappearances of the phrases “in one embodiment” or “in an embodiment” invarious places throughout this specification are not necessarily allreferring to the same embodiment. Furthermore, the particular features,structures, or characteristics may be combined in any suitable manner inone or more embodiments.

In the foregoing specification, a detailed description has been givenwith reference to specific exemplary embodiments. It will, however, beevident that various modifications and changes may be made theretowithout departing from the broader spirit and scope of the invention asset forth in the appended claims. The specification and drawings are,accordingly, to be regarded in an illustrative sense rather than arestrictive sense. Furthermore, the foregoing use of embodiment andother exemplarily language does not necessarily refer to the sameembodiment or the same example, but may refer to different and distinctembodiments, as well as potentially the same embodiment.

What is claimed is:
 1. An apparatus comprising: a board comprisingPeripheral Component Interconnect Express (PCIe) electricals, whereinthe PCIe electricals are to implement a PCIe physical layer; firstprotocol stack circuitry to implement at least a portion of a PCIestack, wherein the first protocol stack circuitry is to generatePCIe-based packets and cause the PCIe-based packets to be sent over thePCIe physical layer; and second protocol stack circuitry to implement atleast a portion of another protocol stack, wherein the other protocolstack is to implement a memory interconnect, and the second protocolstack circuity is to generate flits of a different protocol and causethe flits to be sent over the PCIe physical layer.
 2. The apparatus ofclaim 1, wherein the different protocol comprises a non-PCIe protocol.3. The apparatus of claim 1, further comprising one or more cores. 4.The apparatus of claim 1, wherein the second protocol stack circuityimplements two or more of a transaction layer, a protocol layer, and alink layer.
 5. The apparatus of claim 1, wherein the other protocolcomprises a memory access protocol.
 6. The apparatus of claim 1, whereinthe other protocol comprises a cache coherent protocol.
 7. The apparatusof claim 1, wherein the other protocol realizes a lower latency thanPCIe.
 8. The apparatus of claim 1, further comprising a controller tomultiplex the PCIe-based packets and the flits of the differentprotocol.
 9. The apparatus of claim 1, further comprising a link toconnect to another device, wherein the link is implemented by PCIeelectricals.
 10. A method comprising: generating Peripheral ComponentInterconnect Express (PCIe)-based packets in a first mode; sending thePCIe-based packets over PCIe electricals of a board during the firstmode, wherein the PCIe electricals implement a PCIe physical layer of alink; generating flits of a different other protocol in a second mode,wherein the other protocol comprises a memory interconnect protocol; andsending the flits over the PCIe electricals of a board during the secondmode.
 11. The method of claim 10, further comprising multiplexingsending of the packets and sending of the flits.
 12. The method of claim10, wherein data sent in the flits has a lower latency than data sent inthe packets.
 13. The method of claim 10, wherein the packets and flitsare sent from a first device on a board to a second device on the board.14. The method of claim 10, wherein the memory interconnect protocolcomprises a cache coherent protocol.
 15. A system comprising: aninterconnect comprising Peripheral Component Interconnect Express (PCIe)electricals to implement a PCIe physical layer; a first device; a seconddevice coupled to the first device via the interconnect, wherein thesecond device comprises: first protocol stack circuitry to implement atleast a portion of a PCIe stack, wherein the first protocol stackcircuitry is to generate packets of a PCIe-based protocol and cause thepackets to be sent over the PCIe physical layer; and second protocolstack circuitry to implement at least a portion of another protocolstack, wherein the other protocol stack is to implement a memoryinterconnect, and the second protocol stack circuity is to generateflits of a different other protocol and cause the flits to be sent overthe PCIe physical layer.
 16. The system of claim 15, wherein the firstdevice and the second device are on a same package.
 17. The system ofclaim 15, wherein the first device and the second device are on a sameboard.
 18. The system of claim 15, wherein the different protocolcomprises a non-PCIe protocol.
 19. The system of claim 13, wherein thefirst device comprises a processor core.
 20. The system of claim 15,wherein the second protocol stack circuity implements two or more of atransaction layer, a protocol layer, and a link layer.
 21. The system ofclaim 15, wherein the other protocol comprises a memory access protocol.22. The system of claim 15, wherein the other protocol allows datacommunication at a lower latency than the PCIe-based protocol.